In our upcoming project, we aim to create a Bayesian linear regression model to predict calorie burn during different exercises. One of the main challenges we anticipate is the scarcity of high-quality prior data on exercise-related calorie expenditure. Obtaining accurate and representative priors is crucial for the Bayesian approach, and the lack of robust prior information could potentially lead to unreliable predictions. Additionally, incorporating the dynamic nature of individual responses to exercises poses another hurdle. Addressing these challenges is essential for the success of our project, as it directly impacts the model's ability to provide meaningful and accurate predictions for diverse exercise scenarios
Selection of the Data
We decided to choose a data set with almost 15,000 rows and only 11 columns, as we wanted to ensure a robust basis for capturing the complexity of the relationships between predictor variables and caloric expenditure during exercise. The more observations we have, the more robust our conclusions will be and the more reliable our model predictions will be.
Before embarking on this project, we immersed ourselves in a thorough research phase of the problem. Not only did we review the information from the data set, but we also made sure to obtain first-hand information by contacting physical trainers. Their experience and perspectives have proven to be key to understanding the factors that impact caloric expenditure during exercise.
Since one of our colleagues is part of the swimming coaches of Jalisco, this research has a personal and practical meaning for us. We directly understand the importance of optimizing the performance and health of our athletes. Not only does this project have the potential to benefit the community at large, but it can also directly transform the way we plan training routines for our swimmers. The ability to more accurately predict caloric expenditure during exercise can translate into more personalized and effective training plans, thereby improving the overall performance and well-being of the athletes we have the privilege to train.
In summary, the choice of the extensive data set, detailed research and collaboration with professionals in the field, combined with our personal connection as swim coaches, support the strength and relevance of this project. It is not only an academic endeavor, but an initiative that can have a direct and positive impact on our society.
link:
Explanation of the Data of Interest: The dataset used in this project contains relevant information for predicting calorie expenditure during exercise. The included variables are:
These data provide a wide range of information that can influence the amount of calories burned during exercise, thus allowing the construction both Bayesian linear regression model and Bayesian logistic regression.
The justification for this project is grounded in the increasing importance of health and well-being in today's society. With the growing interest in physical activity and awareness of the importance of maintaining a healthy lifestyle, the ability to predict calorie expenditure during exercise can be of great utility for individuals of all ages.
The Bayesian linear regression approach provides a robust tool for modeling the relationship between various predictor variables, such as age, gender, exercise duration, heart rate, body temperature, among others, and the variable of interest, which in this case is the amount of calories burned. Through this model, we can obtain more accurate and useful predictions to help individuals plan their exercise routines more effectively.
The practical application of this project lies in its ability to provide users with a more accurate estimate of the calories they will burn during their exercise sessions. This can be especially valuable for personalized routine planning and fitness goals.
Routine Optimization: Users can tailor their exercise routines to efficiently achieve specific calorie-burning goals.
Performance Improvement: The ability to foresee the amount of calories burned allows users to adjust exercise intensity and duration to maximize results.
Personalization: By taking individual factors such as age, gender, and other variables into account, the model offers more personalized predictions, providing users with more relevant information for their specific needs.
This project is motivated by the increasing emphasis on health and fitness, where individuals often set specific calorie-burning goals for their exercise routines. The classification aspect of the project is aimed at determining whether an exercise session successfully achieved the stated goal of burning, for example, 300 or more calories. This is particularly relevant in the context of personalized fitness plans, where users strive to monitor and attain specific calorie-burning milestones.
The practical application of this classification project lies in its ability to assess whether an exercise session achieved its intended calorie-burning goal. The benefits include:
Goal Evaluation: Users can receive feedback on whether their exercise sessions met the targeted calorie-burning objectives.
Adaptation of Workouts: Individuals can adjust and tailor their workout routines based on the classification results to better align with their fitness goals.
Objective Monitoring: The model provides a systematic approach to monitor and evaluate progress toward specific calorie-burning targets.
The project was carried out by the university team formed by Maria Paula Perez Romo, Dafne Tamayo Leon and Patricio Villanueva Gio. their github links are the following
In turn direct thanks to the teacher of the subject Esteban Jimenez Rodriguez whose link to his github profile is: https://github.com/esjimenezro
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Matplotlib, seaborn and arviz for visualization
import matplotlib.pyplot as plt
import seaborn as sns
import arviz as az
%matplotlib inline
# Linear Regression to verify implementation and BayesianRidge to build the Bayesian model
from sklearn.linear_model import LinearRegression, BayesianRidge, Lasso
# Scipy for statistics
import scipy
from scipy.stats import norm, uniform
# Train-Test and mean squared error
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import RFE
from sklearn.decomposition import PCA
# PyMC
import pymc as pm
# Load exercise data from 'exercise.csv'
exercise = pd.read_csv('exercise.csv')
# Load calories data from 'calories.csv'
calories = pd.read_csv('calories.csv')
# Merge exercise and calories dataframes on 'User_ID'
df = pd.merge(exercise, calories, on='User_ID')
# Filter rows where 'Calories' is less than 300
df = df[df['Calories'] < 300]
# Reset the index of the dataframe
df = df.reset_index()
# Add a new column 'Intercept' with constant value 1
df['Intercept'] = 1
# Display the first few rows of the dataframe
df.head()
| index | User_ID | Gender | Age | Height | Weight | Duration | Heart_Rate | Body_Temp | Calories | Intercept | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 14733363 | male | 68 | 190.0 | 94.0 | 29.0 | 105.0 | 40.8 | 231.0 | 1 |
| 1 | 1 | 14861698 | female | 20 | 166.0 | 60.0 | 14.0 | 94.0 | 40.3 | 66.0 | 1 |
| 2 | 2 | 11179863 | male | 69 | 179.0 | 79.0 | 5.0 | 88.0 | 38.7 | 26.0 | 1 |
| 3 | 3 | 16180408 | female | 34 | 179.0 | 71.0 | 13.0 | 100.0 | 40.5 | 71.0 | 1 |
| 4 | 4 | 17771927 | female | 27 | 154.0 | 58.0 | 10.0 | 81.0 | 39.8 | 35.0 | 1 |
Data cleaning
mapeo = {'male':0, 'female':1}
df['Gender'] = df['Gender'].map(mapeo)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14998 entries, 0 to 14997 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 index 14998 non-null int64 1 User_ID 14998 non-null int64 2 Gender 14998 non-null int64 3 Age 14998 non-null int64 4 Height 14998 non-null float64 5 Weight 14998 non-null float64 6 Duration 14998 non-null float64 7 Heart_Rate 14998 non-null float64 8 Body_Temp 14998 non-null float64 9 Calories 14998 non-null float64 10 Intercept 14998 non-null int64 dtypes: float64(6), int64(5) memory usage: 1.3 MB
Understanding data: max-min and the importance of each variable
Data max-Min
for column in df.columns:
print(column)
print(f'max:{df[column].max()} min:{df[column].min()}')
print("")
index max:14999 min:0 User_ID max:19999647 min:10001159 Gender max:1 min:0 Age max:79 min:20 Height max:222.0 min:123.0 Weight max:132.0 min:36.0 Duration max:30.0 min:1.0 Heart_Rate max:128.0 min:67.0 Body_Temp max:41.5 min:37.1 Calories max:295.0 min:1.0 Intercept max:1 min:1
Variable Selection
Methods:
Pearson Correlation
correlation_matrix = df.corr()
threshold = 0.8
high_corr_variables = set()
# Encuentra las variables altamente correlacionadas
for i in range(len(correlation_matrix.columns)):
for j in range(i):
if abs(correlation_matrix.iloc[i, j]) > threshold:
colname = correlation_matrix.columns[i]
high_corr_variables.add(colname)
# Convierte el conjunto a una lista si lo prefieres
high_corr_variables_list = list(high_corr_variables)
# Imprime las variables altamente correlacionadas
print("Variables altamente correlacionadas:", high_corr_variables_list)
Variables altamente correlacionadas: ['Body_Temp', 'Weight', 'Calories', 'Heart_Rate']
Caract Importance
X_proposal = df.drop('Calories', axis=1)
y_proposal = df['Calories']
lasso_model = Lasso(alpha=0.1)
lasso_model.fit(X_proposal, y_proposal)
# Get feature importance
feature_importance = np.abs(lasso_model.coef_)
# Select variables whose coefficient is greater than 0.8
significant_variables = X_proposal.columns[feature_importance > 0.8]
# Print significant variables
print("Significant variables:", significant_variables)
Significant variables: Index(['Duration', 'Heart_Rate', 'Body_Temp'], dtype='object')
RFE
# Create a linear regression model
model = LinearRegression()
# Select features using RFE
rfe = RFE(model, n_features_to_select=3) # Adjust the desired number of features
fit = rfe.fit(X_proposal, y_proposal)
# Selected variables
selected_variables = X_proposal.columns[fit.support_]
print("selected_variables:",selected_variables)
selected_variables: Index(['Gender', 'Duration', 'Body_Temp'], dtype='object')
Analisis of the data extracted from our research with bewellness gym trainers
plt.figure(figsize=(8, 8))
# Plot 'Duration' against 'Calories' using magenta circles ('mo')
plt.plot(df['Duration'], df['Calories'], 'mo')
plt.xlabel('Duration (min)', size=18)
plt.ylabel('Calories', size=18)
plt.title('Calories burned vs Duration of Exercise', size=20)
plt.show()
One of the most common and most important observations would be to try to predict the amount of calories burned only with the exercise time, this will allow us to give a more generic answer to certain scenarios where we do not seek to investigate the characteristics of the individual or his training.
X1 = df.loc[:, ['Intercept', 'Duration']]
y1 = df.loc[:, 'Calories']
X1.head()
| Intercept | Duration | |
|---|---|---|
| 0 | 1 | 29.0 |
| 1 | 1 | 14.0 |
| 2 | 1 | 5.0 |
| 3 | 1 | 13.0 |
| 4 | 1 | 10.0 |
y1.head()
0 231.0 1 66.0 2 26.0 3 71.0 4 35.0 Name: Calories, dtype: float64
Ordinary Least squares linear regression by hand
def linear_regression(X, y):
# Calculate the coefficients using the formula: (X^T * X)^-1 * X^T * y
_coeffs = np.matmul(np.matmul(np.linalg.inv(np.matmul(X.T, X)), X.T), y)
return _coeffs
# Run the by hand implementation
by_hand_coefs = linear_regression(X1, y1)
import numpy as np
import matplotlib.pyplot as plt
# Generate a range of x values for the regression line
xs = np.linspace(4, 31, 1000)
# Calculate y values for the regression line using the coefficients obtained by hand
ys = by_hand_coefs[0] + by_hand_coefs[1] * xs
plt.figure(figsize=(8, 8))
plt.plot(df['Duration'], df['Calories'], 'bo', label='observations', alpha=0.8)
plt.xlabel('Duration (min)', size=18)
plt.ylabel('Calories', size=18)
# Plot the regression line as a red dashed line ('r--')
plt.plot(xs, ys, 'r--', label='OLS Fit', linewidth=3)
plt.legend(prop={'size': 16})
plt.title('Calories burned vs Duration of Exercise', size=20)
Text(0.5, 1.0, 'Calories burned vs Duration of Exercise')
Graphically it seems to return a good result. As expected the duration of exercise is totally related to the direct calorie burn and therefore this linear regression can predict with some ease the target variable.
Prediction of Datapoint: we will create a prediction based on the number of minutes spent in a generic exercise, this means that we will seek to predict the target variable only by passing as an attribute the time of duration.
print('Exercising for 15.5 minutes will burn an estimated {:.2f} calories.'.format(
by_hand_coefs[0] + by_hand_coefs[1] * 15.5))
Exercising for 15.5 minutes will burn an estimated 89.30 calories.
Verify with Scikit-learn Implementation
linear_regression = LinearRegression()
linear_regression.fit(np.array(X1.Duration).reshape(-1,1),y1)
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
new_duration_value = 15.5
# Reshape the new duration value as the model expects a 2D array
new_duration_value_reshape = np.array(new_duration_value).reshape(-1, 1)
# Use the trained model to make the prediction
prediction = linear_regression.predict(new_duration_value_reshape)
print("The prediction for the new duration value is:", prediction)
The prediction for the new duration value is: [89.30353939]
As we can see, both models return exactly the same result and this is very good as it indicates that we managed to implement the linear regression by hand and resulted in a "good" predictive model.
Bayesian Linear Regression (only duration)
# Define the features (X) and the target variable (y)
X = df[['Duration']]
y = df['Calories']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit the Bayesian Ridge regression model
regressor = BayesianRidge()
regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = regressor.predict(X_test)
# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared (R²): {r2}')
# Plot the results
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Comparison between Actual Values and Predictions')
plt.show()
Mean Squared Error: 335.65596279905805 R-squared (R²): 0.9127566471888006
now we will implement the Bayesian linear regression model in order to determine if the resulting value is similar to those obtained previously in the linear regression.
new_data = np.array([15.5])
# Reshape the new data as the model expects a 2D array
new_data_reshape = new_data.reshape(1, -1)
# Make a prediction on the new data
prediction_new_data = regressor.predict(new_data_reshape)
# Print the prediction
print("Prediction for the new data:", prediction_new_data)
Prediction for the new data: [89.37148851]
as we can see, the result is similar to the others, we can now state that based on the benchmark given by the linear regression model, the set evaluated in the Bayesian regression fits the prediction "adequately".
Bayesian Linear Regression (Only Duration PyMC)
propousal = [[40,20],[60,40],[89,40],]
for i in propousal:
prior_mu = norm(loc=i[0], scale=i[1])
prior_sigma = uniform(loc=0, scale=20)
N = 1000
samples_mu = prior_mu.rvs(size=N)
samples_sigma = prior_sigma.rvs(size=N)
samples_height = norm.rvs(loc=samples_mu, scale=samples_sigma)
az.plot_kde(samples_height, bw=5)
plt.show()
The passage discusses a non-Gaussian prior distribution for height probabilities before observing data. It underscores the importance of prior predictive simulation as a powerful tool to evaluate the adequacy of priors, allowing for the identification of potential issues with certain priors.
# we define the model
with pm.Model() as Calories_model:
mu = pm.Normal(
name="mu",
mu=170,
sigma=20
)
sigma = pm.Uniform(
name="sigma",
lower=0,
upper=50
)
height = pm.Normal(
name="Calories",
mu=mu,
sigma=sigma,
observed=df["Calories"]
)
with Calories_model:
Calories_idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [mu, sigma]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 22 seconds. There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
# idata object
Calories_idata
<xarray.Dataset>
Dimensions: (chain: 4, draw: 1000)
Coordinates:
* chain (chain) int32 0 1 2 3
* draw (draw) int32 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
mu (chain, draw) float64 89.29 89.62 89.27 89.11 ... 88.76 89.6 89.49
sigma (chain, draw) float64 50.0 50.0 49.99 50.0 ... 50.0 50.0 49.98 50.0
Attributes:
created_at: 2023-11-29T05:04:21.025827
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0
sampling_time: 22.11118173599243
tuning_steps: 1000<xarray.Dataset>
Dimensions: (chain: 4, draw: 1000)
Coordinates:
* chain (chain) int32 0 1 2 3
* draw (draw) int32 0 1 2 3 4 5 ... 994 995 996 997 998 999
Data variables: (12/17)
step_size (chain, draw) float64 0.8614 0.8614 ... 0.6875 0.6875
energy (chain, draw) float64 8.416e+04 ... 8.416e+04
process_time_diff (chain, draw) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan
reached_max_treedepth (chain, draw) bool False False False ... False False
n_steps (chain, draw) float64 1.0 3.0 7.0 1.0 ... 3.0 3.0 7.0
... ...
acceptance_rate (chain, draw) float64 1.0 0.9737 ... 0.7283 0.7615
tree_depth (chain, draw) int64 1 2 3 1 2 1 1 2 ... 1 2 2 2 2 2 3
smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan
perf_counter_start (chain, draw) float64 8.264e+04 ... 8.264e+04
lp (chain, draw) float64 -8.416e+04 ... -8.416e+04
index_in_trajectory (chain, draw) int64 -1 -1 -1 1 -2 1 ... 2 2 2 -2 -3
Attributes:
created_at: 2023-11-29T05:04:21.049344
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0
sampling_time: 22.11118173599243
tuning_steps: 1000<xarray.Dataset>
Dimensions: (Calories_dim_0: 14998)
Coordinates:
* Calories_dim_0 (Calories_dim_0) int32 0 1 2 3 4 ... 14994 14995 14996 14997
Data variables:
Calories (Calories_dim_0) float64 231.0 66.0 26.0 ... 75.0 11.0 98.0
Attributes:
created_at: 2023-11-29T05:04:21.058265
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0# az.plot_trace
az.plot_trace(Calories_idata)
array([[<Axes: title={'center': 'mu'}>, <Axes: title={'center': 'mu'}>],
[<Axes: title={'center': 'sigma'}>,
<Axes: title={'center': 'sigma'}>]], dtype=object)
# az.summary
az.summary(
Calories_idata,
kind="stats",
hdi_prob=0.89
)
| mean | sd | hdi_5.5% | hdi_94.5% | |
|---|---|---|---|---|
| mu | 89.540 | 0.401 | 88.926 | 90.209 |
| sigma | 49.994 | 0.006 | 49.987 | 50.000 |
duration = df["Duration"]
d_bar = duration.mean()
with pm.Model() as calories_model_predictive:
sigma = pm.Uniform("sigma", 0, 50)
a = pm.Normal("a", 170, 20)
b = pm.LogNormal("b", 0, 1)
mu = a + b * (duration - d_bar)
Calories = pm.Normal("Calories", mu, sigma, observed=df["Calories"])
Calories_pred_idata = pm.sample()
Auto-assigning NUTS sampler... Initializing NUTS using jitter+adapt_diag... Multiprocess sampling (4 chains in 4 jobs) NUTS: [sigma, a, b]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 27 seconds.
# Distribution of parameters
az.summary(
Calories_pred_idata,
kind="stats",
hdi_prob=0.89
)
| mean | sd | hdi_5.5% | hdi_94.5% | |
|---|---|---|---|---|
| a | 89.515 | 0.156 | 89.267 | 89.754 |
| sigma | 18.386 | 0.106 | 18.216 | 18.555 |
| b | 7.170 | 0.018 | 7.141 | 7.199 |
az.plot_trace(Calories_pred_idata)
array([[<Axes: title={'center': 'a'}>, <Axes: title={'center': 'a'}>],
[<Axes: title={'center': 'sigma'}>,
<Axes: title={'center': 'sigma'}>],
[<Axes: title={'center': 'b'}>, <Axes: title={'center': 'b'}>]],
dtype=object)
Calories_pred_idata
<xarray.Dataset>
Dimensions: (chain: 4, draw: 1000)
Coordinates:
* chain (chain) int32 0 1 2 3
* draw (draw) int32 0 1 2 3 4 5 6 7 8 ... 992 993 994 995 996 997 998 999
Data variables:
a (chain, draw) float64 89.44 89.51 89.47 89.68 ... 89.37 89.68 89.7
sigma (chain, draw) float64 18.37 18.23 18.25 18.3 ... 18.4 18.33 18.31
b (chain, draw) float64 7.2 7.141 7.14 7.199 ... 7.185 7.155 7.182
Attributes:
created_at: 2023-11-29T05:04:53.691386
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0
sampling_time: 27.27321720123291
tuning_steps: 1000<xarray.Dataset>
Dimensions: (chain: 4, draw: 1000)
Coordinates:
* chain (chain) int32 0 1 2 3
* draw (draw) int32 0 1 2 3 4 5 ... 994 995 996 997 998 999
Data variables: (12/17)
step_size (chain, draw) float64 1.311 1.311 ... 0.9885 0.9885
energy (chain, draw) float64 6.497e+04 ... 6.497e+04
process_time_diff (chain, draw) float64 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0
largest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan
reached_max_treedepth (chain, draw) bool False False False ... False False
n_steps (chain, draw) float64 3.0 3.0 1.0 3.0 ... 3.0 3.0 3.0
... ...
acceptance_rate (chain, draw) float64 0.828 0.886 ... 0.8583 0.8144
tree_depth (chain, draw) int64 2 2 1 2 2 2 2 3 ... 2 2 2 1 2 2 2
smallest_eigval (chain, draw) float64 nan nan nan nan ... nan nan nan
perf_counter_start (chain, draw) float64 8.267e+04 ... 8.267e+04
lp (chain, draw) float64 -6.497e+04 ... -6.497e+04
index_in_trajectory (chain, draw) int64 1 -2 -1 2 1 -2 ... -1 -1 -2 3 2
Attributes:
created_at: 2023-11-29T05:04:53.708155
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0
sampling_time: 27.27321720123291
tuning_steps: 1000<xarray.Dataset>
Dimensions: (Calories_dim_0: 14998)
Coordinates:
* Calories_dim_0 (Calories_dim_0) int32 0 1 2 3 4 ... 14994 14995 14996 14997
Data variables:
Calories (Calories_dim_0) float64 231.0 66.0 26.0 ... 75.0 11.0 98.0
Attributes:
created_at: 2023-11-29T05:04:53.715897
arviz_version: 0.16.1
inference_library: pymc
inference_library_version: 5.9.0posterior_df = Calories_pred_idata.posterior.to_dataframe()
posterior_df.head()
| a | sigma | b | ||
|---|---|---|---|---|
| chain | draw | |||
| 0 | 0 | 89.441722 | 18.370709 | 7.199831 |
| 1 | 89.506148 | 18.225344 | 7.140684 | |
| 2 | 89.466536 | 18.248159 | 7.139838 | |
| 3 | 89.680035 | 18.303279 | 7.198502 | |
| 4 | 89.625499 | 18.379848 | 7.182089 |
We tested the same value as before. 15 minutes with 30 seconds of training 15.5 to compare its result with that of the other models.
samples = np.random.randint(low=0, high=4000, size=500)
sampled_post = posterior_df.iloc[samples]
post_mu = sampled_post["a"].values + sampled_post["b"].values * (
df["Duration"].values.reshape(-1, 1)
- df["Duration"].mean()
)
# mu at 15.5 (Prediction)
mu_at_15 = sampled_post["a"].values + sampled_post["b"].values * (
15.5 - df["Duration"].mean()
)
print('number of tests',mu_at_15.shape[0])
number of tests 500
print(f'the average of the {mu_at_15.shape[0]} evaluations for the {15.5} minutes workout is:{mu_at_15.mean()}')
the average of the 500 evaluations for the 15.5 minutes workout is:89.30825915407148
As we can see, it returns a very similar value to both the linear regression and the Bayesian linear regression performed with the sklearn library.
Bayesian Linear Regression (all data)
We now feed the model with all the selected variables to build the Bayesian model and evaluate the effectiveness
# Define the features (X) and the target variable (y)
X = df[['Age', 'Height', 'Weight', 'Duration', 'Heart_Rate', 'Body_Temp']]
y = df['Calories']
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit the Bayesian Ridge regression model
regressor = BayesianRidge()
regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = regressor.predict(X_test)
# Evaluate the model performance
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse}')
print(f'R-squared (R²): {r2}')
# Plot the results
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Comparison between Actual Values and Predictions')
plt.show()
Mean Squared Error: 129.79387499766082 R-squared (R²): 0.9662641094329892
# Concatenate X_train and y_train to have a complete DataFrame
train_data = pd.concat([X_train, y_train], axis=1)
# Visualize the relationship of each feature with the target variable
plt.figure(figsize=(15, 10))
for i, column in enumerate(train_data.columns[:-1]):
plt.subplot(3, 2, i+1)
sns.scatterplot(x=column, y='Calories', data=train_data)
plt.title(f'Relationship between {column} and Calories')
plt.tight_layout()
plt.show()
residuals = y_test - y_pred
# Plot a histogram of residuals
plt.hist(residuals, bins=30, edgecolor='black')
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.title('Residuals Histogram')
plt.show()
plt.scatter(y_pred, residuals)
plt.axhline(y=0, color='red', linestyle='--')
plt.xlabel('Predictions')
plt.ylabel('Residuals')
plt.title('Residual Scatter Plot')
plt.show()
plt.scatter(y_test, y_pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], linestyle='--', color='red')
plt.xlabel('Actual Values')
plt.ylabel('Predictions')
plt.title('Fit Plot')
plt.show()
new_data = np.array([19, 159, 55, 25, 118, 38.5])
# Reshape the new data as the model expects a 2D array
new_data_reshape = new_data.reshape(1, -1)
# Make a prediction on the new data
prediction_new_data = regressor.predict(new_data_reshape)
# Print the prediction
print("Prediction for the new data:", prediction_new_data)
Prediction for the new data: [208.25165528]
In order to have a more "real" comparison, Patricio tried to predict with this model his exercise performed on 11/26/2023. I am a 19 year old male who is 1.58 years old. I went swimming for 25 minutes specifically and the result marked by my calorie confidence meter was
new_data = np.array([19,163,55,68,140,44])
# Reshape the new data as the model expects a 2D array
new_data_reshape = new_data.reshape(1, -1)
# Make a prediction on the new data
prediction_new_data = regressor.predict(new_data_reshape)
# Print the prediction
print("Prediction for the new data:", prediction_new_data)
Prediction for the new data: [443.9203083]
In order to have a more "real" comparison, Maria Paula tried to predict with this model her exercise performed on 11/06/2023. She is a 19 year old woman who is 1.63 years old. she went to train football at the iteso fields for 68 minutes. And this is the result obtained by your apple watch.
import warnings
# Ignore all warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import BayesianRidge
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
# Load exercise data from 'exercise.csv'
exercise = pd.read_csv('exercise.csv')
# Load calories data from 'calories.csv'
calories = pd.read_csv('calories.csv')
# Merge exercise and calories dataframes on 'User_ID'
dfc = pd.merge(exercise, calories, on='User_ID')
# Reset the index of the dataframe
dfc = df.reset_index()
# Min calories
min_calories = 150
# Display the first few rows of the dataframe
dfc.head()
| level_0 | index | User_ID | Gender | Age | Height | Weight | Duration | Heart_Rate | Body_Temp | Calories | Intercept | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 14733363 | 0 | 68 | 190.0 | 94.0 | 29.0 | 105.0 | 40.8 | 231.0 | 1 |
| 1 | 1 | 1 | 14861698 | 1 | 20 | 166.0 | 60.0 | 14.0 | 94.0 | 40.3 | 66.0 | 1 |
| 2 | 2 | 2 | 11179863 | 0 | 69 | 179.0 | 79.0 | 5.0 | 88.0 | 38.7 | 26.0 | 1 |
| 3 | 3 | 3 | 16180408 | 1 | 34 | 179.0 | 71.0 | 13.0 | 100.0 | 40.5 | 71.0 | 1 |
| 4 | 4 | 4 | 17771927 | 1 | 27 | 154.0 | 58.0 | 10.0 | 81.0 | 39.8 | 35.0 | 1 |
dfc.Calories.max()
295.0
Data cleaning and adjusting the classes
mapeo = {'Male':0, 'Female':1} dfc['Gender'] = dfc['Gender'].map(mapeo)
dfc['Calories'] = dfc['Calories'].apply(lambda x: 1 if x >= min_calories else 0)
dfc.Calories.unique()
array([1, 0], dtype=int64)
dfc.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14998 entries, 0 to 14997 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 level_0 14998 non-null int64 1 index 14998 non-null int64 2 User_ID 14998 non-null int64 3 Gender 14998 non-null int64 4 Age 14998 non-null int64 5 Height 14998 non-null float64 6 Weight 14998 non-null float64 7 Duration 14998 non-null float64 8 Heart_Rate 14998 non-null float64 9 Body_Temp 14998 non-null float64 10 Calories 14998 non-null int64 11 Intercept 14998 non-null int64 dtypes: float64(5), int64(7) memory usage: 1.4 MB
We create a simple model to classify with pymc
# Model with a single parameter (The target)
with pm.Model() as simple_model:
a = pm.Normal("a", 0, 10)
p = pm.Deterministic("p", pm.math.invlogit(a))
L = pm.Bernoulli(
"Calories",
p,
observed=df["Calories"].values
)
prior_sample = pm.sample_prior_predictive()
Sampling: [Calories, a]
# We plot density of p
az.plot_density(
prior_sample["prior"],
group="prior",
colors=["k"],
var_names=["p"],
point_estimate=None,
)
plt.xlabel("prior prob pull left")
plt.ylabel("Density")
Text(0, 0.5, 'Density')
The code defines a Bayesian model with a single parameter, a, which is assumed to be normally distributed with mean 0 and standard deviation 10. This parameter represents the target value, which is the probability of consuming calories. The model also defines a deterministic variable, p, which is the inverse logit of a. This means that p is the probability of consuming calories, given the value of a. Finally, the model defines a Bernoulli observation variable, L, which represents the observed data (whether or not calories were consumed). The probability of L being true is p.
The code then samples from the prior distribution of the model using pm.sample_prior_predictive(). This function generates a set of samples from the prior distribution, which is the distribution of the parameters before any data is observed. The resulting samples are stored in the variable prior_sample.
The plot shows the density of the prior probability of pulling left. The density is highest at around 0.5, which means that most participants had a prior probability of pulling left of around 50%. The density then decreases gradually as the prior probability of pulling left moves away from 0.5. This means that it is less likely for participants to have a prior probability of pulling left that is very different from 0.5.
The overall shape of the density plot is unimodal, which means that there is a single peak. This suggests that there is a single most likely value for the prior probability of pulling left, which is around 0.5.
Overall, the plot suggests that the prior probability of pulling left has a significant impact on the behavior of participants. Participants with a prior probability of pulling left that is close to 0.5 are more likely to pull left, while participants with a prior probability of pulling left that is very different from 0.5 are less likely to pull left.
Conclusions that can be drawn from the plot:
# Model with a single parameter (The target)
with pm.Model() as simple_model:
a = pm.Normal("a", 0, 1.5)
p = pm.Deterministic("p", pm.math.invlogit(a))
L = pm.Bernoulli(
"Calories",
p,
observed=df["Calories"].values
)
prior_sample = pm.sample_prior_predictive()
Sampling: [Calories, a]
# We plot density of p
az.plot_density(
prior_sample["prior"],
group="prior",
colors=["k"],
var_names=["p"],
point_estimate=None,
)
plt.xlabel("prior prob pull left")
plt.ylabel("Density")
Text(0, 0.5, 'Density')
This code defines a Bayesian model with a single parameter, a, which represents the underlying probability of consuming calories. The model assumes that a follows a normal distribution with a mean of 0 and a standard deviation of 1.5. This means that the prior belief is that the true probability of consuming calories is most likely to be around 0, but it could be as low as -1.5 or as high as 1.5.
The model also defines a deterministic variable, p, which represents the probability of consuming calories given the value of a. This probability is calculated using the inverse logit function, which ensures that p is always between 0 and 1.
Finally, the model defines an observation variable, L, which represents the observed data on whether or not calories were consumed. This variable is modeled as a Bernoulli distribution, which means that it can only take on two values: 0 or 1. The probability of L being 1 is equal to p, which depends on the value of a.
The pm.sample_prior_predictive() function is used to generate a set of samples from the prior distribution of the model. This means that it generates a set of possible values for the parameter a, before any data is observed. The resulting samples are stored in the variable prior_sample.
The density is highest at around 0.5, which means that most participants had a prior probability of consuming calories of around 50%. The density then decreases gradually as the prior probability of consuming calories moves away from 0.5. This means that it is less likely for participants to have a prior probability of consuming calories that is very different from 0.5.
The overall shape of the density plot is unimodal, which means that there is a single peak. This suggests that there is a single most likely value for the prior probability of consuming calories, which is around 0.5.
Overall, the plot suggests that the prior probability of consuming calories has a significant impact on the behavior of participants. Participants with a prior probability of consuming calories that is close to 0.5 are more likely to consume calories, while participants with a prior probability of consuming calories that is very different from 0.5 are less likely to consume calories.
Conclusions that can be drawn from the plot:
we create the Bayesian classification model with the logistic regression and all variables
X = dfc.drop('Calories', axis=1)
y = dfc['Calories']
X.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14998 entries, 0 to 14997 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 level_0 14998 non-null int64 1 index 14998 non-null int64 2 User_ID 14998 non-null int64 3 Gender 14998 non-null int64 4 Age 14998 non-null int64 5 Height 14998 non-null float64 6 Weight 14998 non-null float64 7 Duration 14998 non-null float64 8 Heart_Rate 14998 non-null float64 9 Body_Temp 14998 non-null float64 10 Intercept 14998 non-null int64 dtypes: float64(5), int64(6) memory usage: 1.3 MB
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit the Bayesian Ridge regression model
classifier = BayesianRidge()
classifier.fit(X_train, y_train)
BayesianRidge()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
BayesianRidge()
# Make predictions on the test set
y_pred = np.round(classifier.predict(X_test))
# Evaluate the model performances
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy:.2f}')
Accuracy: 0.95
We will evaluate the model with the F1-Score as we want to penalize both misinterpretations of the model. The F1-Score is a metric commonly used in binary classification problems to evaluate the performance of a model, particularly when there is an imbalance between the classes. It is the harmonic mean of precision and recall and is calculated using the following formula where P is precision and R recall: $$2* {\frac{P*R}{P+R}}$$
# Print classification report and confusion matrix
print(classification_report(y_test, y_pred))
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')
precision recall f1-score support
0 0.98 0.97 0.97 2405
1 0.87 0.91 0.89 595
accuracy 0.95 3000
macro avg 0.92 0.94 0.93 3000
weighted avg 0.96 0.95 0.96 3000
Confusion Matrix:
[[2324 81]
[ 55 540]]
# Plot the confusion matrix
plt.figure(figsize=(6, 6))
plt.imshow(conf_matrix, cmap='Blues', interpolation='nearest')
plt.title('Confusion Matrix')
plt.colorbar()
plt.xlabel('Predictions')
plt.ylabel('Actual Values')
plt.show()
For the final conclusions we will compile both the results obtained in the model evaluations and the data obtained in the real life tests. In addition, we will consider the reliability of the selected methods as well as their main advantages and disadvantages. This in order to better understand the actual usefulness of our model.
A Bayesian linear regression model extends the classical linear regression by incorporating the Bayesian approach, a statistical framework rooted in Bayesian probability theory. Unlike classical statistics, which treats model parameters as fixed values, Bayesian statistics treats these parameters as probability distributions. In the context of linear regression, this means that instead of providing point estimates for the regression coefficients. This brings with it a number of advantages and disadvantages when applied.
Advantages
Bayesian flexibility: It allows the inclusion of prior information in the model, which is useful when there is prior knowledge about the parameters.
Explicit Uncertainty: Provides parameter estimates accompanied by probability distributions, which reflects the uncertainty in the estimates.
Handling Overfitting: Incorporating Bayesian regularization terms can help avoid overfitting the model to the training data.
Continuous Updating: As new data is collected, the model can be efficiently updated using Bayes' theorem, making it suitable for real-time applications.
Incorporation of Prior Information: The ability to incorporate prior knowledge allows for improved estimates when expert information is available.
Uncertainty Management: Provides estimates not only of the parameters, but also of the associated uncertainty, which is useful in situations where certainty is critical.
Natural Regularization: The Bayesian approach provides a natural way to introduce regularization terms, helping to avoid overfitting.
Disadvantages
Computationally Intensive: Compared to standard linear models, Bayesian models can be more computationally intensive due to the need to perform calculations on probability distributions.
Complexity in Prior Specification: The choice of prior can be critical and can significantly affect model results, which can be challenging if clear information about the prior distribution is not available.
Potentially Complex Interpretation: Interpretation of probability distributions and associated uncertainty can be more complex for those unfamiliar with Bayesian methods.
Requires Sufficient Data: For significant benefits, a sufficient amount of data is often required to update estimates effectively.
Linear Regression Model
The conclusion of the Bayesian linear regression model stablish the accuracy of the model is a good fit for the data, with a high R-squared value of 0.966264. This means that the model can explain 96.62% of the variation in the data. The mean squared error (MSE) of 129.793875 is also relatively low, which indicates that the model is not making large errors in its predictions.
The scatter plot shows that the predictions of the model are generally close to the actual values, with only a few outliers. This further confirms that the model is a good fit for the data.
Overall, the results of the Bayesian linear regression model suggest that the model can be used to accurately predict the response variable based on the predictor variables.
Here is a more detailed explanation of the results:
Mean squared error (MSE): The MSE is a measure of how well the model fits the data. It is calculated by taking the average of the squared differences between the actual values and the predicted values. A lower MSE indicates a better fit. R-squared (R2): The R-squared value is a measure of how much of the variation in the data is explained by the model. It is calculated as the proportion of the variance in the actual values that can be explained by the model. A higher R-squared value indicates a better fit. In the case of the Bayesian linear regression model that you have trained, the MSE is relatively low and the R-squared value is high. This suggests that the model is a good fit for the data and can be used to accurately predict the response variable based on the predictor variables.
Logistic Regression Model
The confusion matrix for the Bayesian logistic regression model shows that the model is able to correctly classify the majority of the data points, with an overall accuracy of 95%. The precision and recall for both classes are also high, indicating that the model is able to identify both positive and negative cases accurately.
The F1-score is a measure of the overall performance of a classification model, and it takes into account both precision and recall. The F1-score for the Bayesian logistic regression model is also high, at 93%. This further confirms that the model is performing well on this classification task.
Overall, the results of the Bayesian logistic regression model suggest that it is a good fit for the data and can be used to accurately classify new data points.
Here is a more detailed explanation of the metrics in the confusion matrix: